Exploratory Data Analysis (EDA) is the first and crucial step when working with any dataset. It allows us to familiarize ourselves with the data by summarizing its main characteristics, using graphs and visualizations, and forming informed hypotheses. EDA helps in driving the selection of features for modeling by understanding the data and uncovering patterns, trends, missing values, anomalies, or relationships that guide further analysis.
For this tutorial, we will use the Titanic dataset, a classic dataset in data science and machine learning. The Titanic dataset provides information about the passengers who were aboard the RMS Titanic, which tragically sank on its maiden voyage in 1912.
This dataset is rich in both categorical and numerical variables, making it ideal for practicing Exploratory Data Analysis (EDA). It includes information such as:
The Titanic dataset is widely used for tutorials because it offers a variety of data types and challenges, such as missing values, categorical data, and potential interactions between variables. It’s a great way to practice EDA techniques and gain insights that can guide more advanced analyses or predictive modeling.
Throughout this tutorial, we will perform an Exploratory Data Analysis on the Titanic dataset to understand the characteristics of the passengers, explore survival rates, and identify key factors that might have influenced survival.
install.packages("tidyverse")
## The following package(s) will be installed:
## - tidyverse [2.0.0]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
##
## # Installing packages --------------------------------------------------------
## - Installing tidyverse ... OK [linked from cache]
## Successfully installed 1 package in 15 milliseconds.
install.packages("plyr")
## The following package(s) will be installed:
## - plyr [1.8.9]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
##
## # Installing packages --------------------------------------------------------
## - Installing plyr ... OK [linked from cache]
## Successfully installed 1 package in 12 milliseconds.
install.packages("patchwork")
## The following package(s) will be installed:
## - patchwork [1.2.0]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
##
## # Installing packages --------------------------------------------------------
## - Installing patchwork ... OK [linked from cache]
## Successfully installed 1 package in 12 milliseconds.
install.packages("reshape2")
## The following package(s) will be installed:
## - reshape2 [1.4.4]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
##
## # Installing packages --------------------------------------------------------
## - Installing reshape2 ... OK [linked from cache]
## Successfully installed 1 package in 13 milliseconds.
install.packages("GGally")
## The following package(s) will be installed:
## - GGally [2.2.1]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
##
## # Installing packages --------------------------------------------------------
## - Installing GGally ... OK [linked from cache]
## Successfully installed 1 package in 12 milliseconds.
install.packages("factoextra")
## The following package(s) will be installed:
## - factoextra [1.0.7]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
##
## # Installing packages --------------------------------------------------------
## - Installing factoextra ... OK [linked from cache]
## Successfully installed 1 package in 14 milliseconds.
library(patchwork)
library(plyr)
library(tidyverse)
library(dplyr)
library(tidyr)
library(reshape2)
library(GGally)
library(factoextra)
library(readr)
titanic_data <- read_csv("titanic.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(titanic_data)
## # A tibble: 6 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
## 2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
## 3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
## 4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Moran,… male NA 0 0 330877 8.46 <NA>
## # ℹ 1 more variable: Embarked <chr>
The str() function provides a concise summary of the dataset, showing the number of observations (rows), the number of variables (columns) and the data type for ieach variable (e.g. integer, factor, character).
str(titanic_data)
## spc_tbl_ [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr [1:891] "male" "female" "female" "female" ...
## $ Age : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr [1:891] NA "C85" NA "C123" ...
## $ Embarked : chr [1:891] "S" "C" "S" "S" ...
## - attr(*, "spec")=
## .. cols(
## .. PassengerId = col_double(),
## .. Survived = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. SibSp = col_double(),
## .. Parch = col_double(),
## .. Ticket = col_character(),
## .. Fare = col_double(),
## .. Cabin = col_character(),
## .. Embarked = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
To get a quick summary of each variable including basic descriptive statistics for numerical variables and frequency counts for categorical variables, use the summary() function.
summary(titanic_data)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
colSums(is.na(titanic_data))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
Here we can see that the columns Age, Cabin, Embarked contain missing values, we will need to pay attention to this during data cleaning.
Before diving deeper into analysis, it’s crucial to ensure that your data is clean, consistent, and ready for exploration. Data cleaning and preparation is a foundational step in any data analysis project, where we address issues like missing values, inconsistencies, and incorrect data types. This process not only improves the quality of your analysis but also helps in drawing more accurate and reliable conclusions.
In this chapter, we will walk through the essential techniques and best practices for cleaning and preparing your dataset. By the end of this section, you’ll be equipped to handle common data challenges and transform raw data into a structured format that’s ready for analysis.
Common approaches to dealing with missing data:
What do you think are their pros & cons?
Age column# let's first examine the distribution of values in the Age column
library(ggplot2)
ggplot(titanic_data, aes(x=Age)) + geom_histogram(bins = 30)
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
Notice the warning caused by the 177 missing values.
qqnorm(titanic_data$Age)
qqline(titanic_data$Age, col="steelblue", lwd = 2)
We see that the distribution of Age is skewed to the left (positive skew). We will use the most common value to fill in the missing ones. Which is why we will use
# create frequency table and get mode
frequency_table <- table(titanic_data$Age)
mode_value <- as.numeric(names(frequency_table)[which.max(frequency_table)])
mode_value
## [1] 24
Unfortunately there is no implemented function for mode in R, so we will create a frequency table for all the values in the Age column and pick the most frequent one, which in this case is 24.
# replace missing values in the Age column with the mode
titanic_data <- titanic_data %>% mutate(Age = replace_na(Age, mode_value))
sum(is.na(titanic_data$Age))
## [1] 0
ggplot(titanic_data, aes(x=Age)) + geom_histogram(bins = 30)
From the histogram after imputation we can see that we modified the distribution significantly. It is a compromise that we need to take because in this case our best guess is the mode.
687 out of 891 of the values in the Cabin column are missing which tells us that this column is not very informative or the survival probability. For this specific case, we will remove the column from the dataset entirely.
# drop the Cabin column
titanic_data <- titanic_data %>% select(-Cabin)
head(titanic_data)
## # A tibble: 6 × 11
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 0 3 Braund, Mr. … male 22 1 0 A/5 2… 7.25
## 2 2 1 1 Cumings, Mrs… fema… 38 1 0 PC 17… 71.3
## 3 3 1 3 Heikkinen, M… fema… 26 0 0 STON/… 7.92
## 4 4 1 1 Futrelle, Mr… fema… 35 1 0 113803 53.1
## 5 5 0 3 Allen, Mr. W… male 35 0 0 373450 8.05
## 6 6 0 3 Moran, Mr. J… male 24 0 0 330877 8.46
## # ℹ 1 more variable: Embarked <chr>
The Embarked column is a categorical one with three possible values: S, C or Q. The most frequent one is S (644 occurrences out of 889). We will replace the missing values with the most frequent one.
ggplot(titanic_data, aes(x = Embarked)) + geom_bar()
# replace the missing values in the Embarked column
count_table <- titanic_data %>% plyr::count("Embarked")
most_frequent <- count_table %>% filter(freq == max(freq)) %>% pull("Embarked")
titanic_data <- titanic_data %>% mutate(Embarked = replace_na(Embarked, most_frequent))
sum(is.na(titanic_data$Embarked))
## [1] 0
ggplot(titanic_data, aes(x = Embarked)) + geom_bar()
Finally let’s convert the column type to categorical.
titanic_data$Embarked <- as.factor(titanic_data$Embarked)
Data transformation is a crucial step in data preparation and analysis. It involves modifying, reshaping, or aggregating your data to make it more suitable for analysis. This process allows you to derive new insights, prepare data for modeling, and ensure that the dataset is in the right format for visualization or further processing.
# check if columns are named consistently
colnames(titanic_data)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Embarked"
# some of the column names are not self-explanatory, some are also shortened, let's unify them
titanic_data <- titanic_data %>% rename(passenger_id = PassengerId, survived = Survived, passenger_class = Pclass, name = Name, sex = Sex, age = Age, siblings_spouses = SibSp, parents_children = Parch, ticket = Ticket, fare = Fare, embarked = Embarked)
colnames(titanic_data)
## [1] "passenger_id" "survived" "passenger_class" "name"
## [5] "sex" "age" "siblings_spouses" "parents_children"
## [9] "ticket" "fare" "embarked"
# the intention here is to minimize the number of unique values in each column to reduce dimensionality
lapply(titanic_data %>% select(-passenger_id, -name, -ticket), unique) # exclude columns that obviously have a unique value in each row
## $survived
## [1] 0 1
##
## $passenger_class
## [1] 3 1 2
##
## $sex
## [1] "male" "female"
##
## $age
## [1] 22.00 38.00 26.00 35.00 24.00 54.00 2.00 27.00 14.00 4.00 58.00 20.00
## [13] 39.00 55.00 31.00 34.00 15.00 28.00 8.00 19.00 40.00 66.00 42.00 21.00
## [25] 18.00 3.00 7.00 49.00 29.00 65.00 28.50 5.00 11.00 45.00 17.00 32.00
## [37] 16.00 25.00 0.83 30.00 33.00 23.00 46.00 59.00 71.00 37.00 47.00 14.50
## [49] 70.50 32.50 12.00 9.00 36.50 51.00 55.50 40.50 44.00 1.00 61.00 56.00
## [61] 50.00 36.00 45.50 20.50 62.00 41.00 52.00 63.00 23.50 0.92 43.00 60.00
## [73] 10.00 64.00 13.00 48.00 0.75 53.00 57.00 80.00 70.00 24.50 6.00 0.67
## [85] 30.50 0.42 34.50 74.00
##
## $siblings_spouses
## [1] 1 0 3 4 2 5 8
##
## $parents_children
## [1] 0 1 2 5 3 4 6
##
## $fare
## [1] 7.2500 71.2833 7.9250 53.1000 8.0500 8.4583 51.8625 21.0750
## [9] 11.1333 30.0708 16.7000 26.5500 31.2750 7.8542 16.0000 29.1250
## [17] 13.0000 18.0000 7.2250 26.0000 8.0292 35.5000 31.3875 263.0000
## [25] 7.8792 7.8958 27.7208 146.5208 7.7500 10.5000 82.1708 52.0000
## [33] 7.2292 11.2417 9.4750 21.0000 41.5792 15.5000 21.6792 17.8000
## [41] 39.6875 7.8000 76.7292 61.9792 27.7500 46.9000 80.0000 83.4750
## [49] 27.9000 15.2458 8.1583 8.6625 73.5000 14.4542 56.4958 7.6500
## [57] 29.0000 12.4750 9.0000 9.5000 7.7875 47.1000 15.8500 34.3750
## [65] 61.1750 20.5750 34.6542 63.3583 23.0000 77.2875 8.6542 7.7750
## [73] 24.1500 9.8250 14.4583 247.5208 7.1417 22.3583 6.9750 7.0500
## [81] 14.5000 15.0458 26.2833 9.2167 79.2000 6.7500 11.5000 36.7500
## [89] 7.7958 12.5250 66.6000 7.3125 61.3792 7.7333 69.5500 16.1000
## [97] 15.7500 20.5250 55.0000 25.9250 33.5000 30.6958 25.4667 28.7125
## [105] 0.0000 15.0500 39.0000 22.0250 50.0000 8.4042 6.4958 10.4625
## [113] 18.7875 31.0000 113.2750 27.0000 76.2917 90.0000 9.3500 13.5000
## [121] 7.5500 26.2500 12.2750 7.1250 52.5542 20.2125 86.5000 512.3292
## [129] 79.6500 153.4625 135.6333 19.5000 29.7000 77.9583 20.2500 78.8500
## [137] 91.0792 12.8750 8.8500 151.5500 30.5000 23.2500 12.3500 110.8833
## [145] 108.9000 24.0000 56.9292 83.1583 262.3750 14.0000 164.8667 134.5000
## [153] 6.2375 57.9792 28.5000 133.6500 15.9000 9.2250 35.0000 75.2500
## [161] 69.3000 55.4417 211.5000 4.0125 227.5250 15.7417 7.7292 12.0000
## [169] 120.0000 12.6500 18.7500 6.8583 32.5000 7.8750 14.4000 55.9000
## [177] 8.1125 81.8583 19.2583 19.9667 89.1042 38.5000 7.7250 13.7917
## [185] 9.8375 7.0458 7.5208 12.2875 9.5875 49.5042 78.2667 15.1000
## [193] 7.6292 22.5250 26.2875 59.4000 7.4958 34.0208 93.5000 221.7792
## [201] 106.4250 49.5000 71.0000 13.8625 7.8292 39.6000 17.4000 51.4792
## [209] 26.3875 30.0000 40.1250 8.7125 15.0000 33.0000 42.4000 15.5500
## [217] 65.0000 32.3208 7.0542 8.4333 25.5875 9.8417 8.1375 10.1708
## [225] 211.3375 57.0000 13.4167 7.7417 9.4833 7.7375 8.3625 23.4500
## [233] 25.9292 8.6833 8.5167 7.8875 37.0042 6.4500 6.9500 8.3000
## [241] 6.4375 39.4000 14.1083 13.8583 50.4958 5.0000 9.8458 10.5167
##
## $embarked
## [1] S C Q
## Levels: C Q S
# even though the data type of the age column is float, most values in the age column are an integer, let's convert the values to the closest integer
titanic_data <- titanic_data %>% mutate(age = round(age))
unique(titanic_data$age)
## [1] 22 38 26 35 24 54 2 27 14 4 58 20 39 55 31 34 15 28 8 19 40 66 42 21 18
## [26] 3 7 49 29 65 5 11 45 17 32 16 25 1 30 33 23 46 59 71 37 47 70 12 9 36
## [51] 51 56 44 61 50 62 41 52 63 43 60 10 64 13 48 53 57 80 6 0 74
We reduced the number of unique values in the age column from 88 to 71. Another float column with a lot of unique values if the fare column. Let’s try to see if the correlation with the survived column changes if we round these values.
# titanic_data <- titanic_data %>% mutate(fare_rounded = round(fare))
# cor(titanic_data$fare, titanic_data$survived)
# cor(titanic_data$fare_rounded, titanic_data$survived)
# titanic_data <- titanic_data %>% select(-fare) %>% rename(fare = fare_rounded)
# length(unique(titanic_data$fare))
The correlation coefficient doesn’t get affected much and we reduced the number of unique values from 248 to 90!
The next step is to convert the sex column to categorical.
titanic_data$sex <- as.factor(titanic_data$sex)
Here we will focus on creating new columns. Is there a column that combines more pieces information that could be split into more columns? Let’s look at the name column. What can we extract from it?
namehead(titanic_data$name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
It seems that the name column follows the format: last_name, title first_name(s) “nickname” (more_first_names?).
# first we split the last name from the title and the last name separated by ","
titanic_data <- titanic_data %>% mutate(title_and_first_names = (str_split(name, ", ", simplify = TRUE))[,2])
titanic_data <- titanic_data %>% mutate(title = str_split(title_and_first_names, " ", simplify = TRUE)[, 1])
titanic_data <- titanic_data %>% select(-title_and_first_names)
head(titanic_data$title)
## [1] "Mr." "Mrs." "Miss." "Mrs." "Mr." "Mr."
ggplot(titanic_data, aes(x = title)) + geom_bar()
Now that we successfuly extracted the title, let’s look at the values. We see that there are some dominant titles like Mr., Mrs., Miss., Master and maybe even Dr. and Rev. The remaining titles are rare. To reduce the number of unique values let’s classify these rare titles into Other.
titanic_data <- titanic_data %>% mutate(
title = if_else(title %in% c("Mr.", "Mrs.", "Miss.", "Master.", "Dr.", "Rev."),
title,
"Other")
)
ggplot(titanic_data, aes(x = title)) + geom_bar()
# finally let's convert the title column to categorical
titanic_data$title <- as.factor(titanic_data$title)
str(titanic_data)
## tibble [891 × 12] (S3: tbl_df/tbl/data.frame)
## $ passenger_id : num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ survived : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
## $ passenger_class : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
## $ name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ age : num [1:891] 22 38 26 35 35 24 54 2 27 14 ...
## $ siblings_spouses: num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
## $ parents_children: num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
## $ ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ title : Factor w/ 7 levels "Dr.","Master.",..: 4 5 3 5 4 4 4 2 5 5 ...
# let's identify outliers of the numerical columns
p1 <- ggplot(titanic_data, aes(x = age)) + geom_boxplot() + coord_flip()
p2 <- ggplot(titanic_data, aes(x = siblings_spouses)) + geom_boxplot() + coord_flip()
p3 <- ggplot(titanic_data, aes(x = parents_children)) + geom_boxplot() + coord_flip()
p4 <- ggplot(titanic_data, aes(x = fare)) + geom_boxplot() + coord_flip()
combined_plot <- p1 + p2 + p3 + p4 + plot_layout(ncol = 4)
combined_plot
Discussion: how would you treat these outliers?
(x - min(x)) / (max(x) - min(x))(x - mean(x)) / sd(x)(x - median(x)) / IQR(x)We have 4 numerical columns: siblings_spouses, parents_children, age and fare. When it comes to the first two, even though they are quantitative, they have specific interpretations in their raw form so we won’t normalize/standardize them. The age and fare column however we’ll normalize.
# min-max normalization of the age column
normalize <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
titanic_data <- titanic_data %>%
mutate(
age = normalize(age)
)
ggplot(titanic_data, aes(x = age)) + geom_histogram(bins = 30)
# robust scaling of the fare column
robust_scale <- function(x) {
(x - median(x, na.rm = TRUE)) / IQR(x, na.rm = TRUE)
}
titanic_data <- titanic_data %>%
mutate(
fare = robust_scale(fare)
)
ggplot(titanic_data, aes(x = fare)) + geom_histogram(bins = 30)
Bivariate Analysis examines the relationship between two variables to understand their interactions and correlations. Techniques used include scatter plots, which visually depict the relationship between two variables, and correlation coefficients, which quantify the strength and direction of their linear relationship. Bivariate analysis helps in identifying patterns, associations, and potential predictive relationships between variables.
Correlation Analysis evaluates the strength and direction of the linear relationship between two variables. It is quantified using the correlation coefficient, which ranges from -1 to 1. A coefficient close to 1 indicates a strong positive relationship, close to -1 indicates a strong negative relationship, and around 0 suggests no linear relationship. Correlation analysis helps in understanding how changes in one variable might be associated with changes in another.
correlation_matrix <- round(cor(titanic_data %>% select(survived, passenger_class, age, siblings_spouses, parents_children, fare)), 2) # considering only numerical columns
melted_correlation_matrix <- melt(correlation_matrix)
ggplot(data = melted_correlation_matrix, aes(x = Var1, y = Var2, fill = value)) + geom_tile()
Scatter Plot is a graphical representation that displays the relationship between two continuous variables. Each point on the plot represents an observation, with its position determined by the values of the two variables. Scatter plots help visualize trends, correlations, and the distribution of data, making it easier to identify patterns, clusters, or outliers.
p1 <- ggplot(titanic_data, aes(x = passenger_class, y = survived)) +
geom_point(alpha = 0.5) +
labs(title = "Passenger class vs Survived")
p2 <- ggplot(titanic_data, aes(x = age, y = survived)) +
geom_point(alpha = 0.5) +
labs(title = "Age vs Survived")
p3 <- ggplot(titanic_data, aes(x = siblings_spouses, y = survived)) +
geom_point(alpha = 0.5) +
labs(title = "Number of siblings and spouses vs Survived")
p4 <- ggplot(titanic_data, aes(x = parents_children, y = survived)) +
geom_point(alpha = 0.5) +
labs(title = "Number of parents and children vs Survived")
p5 <- ggplot(titanic_data, aes(x = fare, y = survived)) +
geom_point(alpha = 0.5) +
labs(title = "Fare vs Survived")
# combine individual plots with patchwork
(p1 | p2 | p3) / (p4 | p5)
Pair Plot (or scatterplot matrix) is a grid of scatter plots that shows the relationships between all pairs of variables in a dataset. Each cell in the matrix is a scatter plot of two variables, while the diagonal typically shows the distribution of each variable. Pair plots provide a comprehensive view of the interactions between variables, helping to identify correlations and patterns across multiple dimensions.
# select relevant columns
numeric_columns <- titanic_data %>% dplyr::select(age, fare, siblings_spouses, parents_children, survived)
# create a pair plot
ggpairs(numeric_columns) +
labs(title = "Pair Plot of Selected Titanic Dataset Variables")
Multivariate analysis enables you to uncover more complex patterns in your data by exploring relationships between multiple variables simultaneously.
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into a set of orthogonal components (principal components) that capture the most variance in the data. By projecting the data onto these components, PCA reduces the number of variables while preserving as much of the original variability as possible. This is useful for simplifying datasets, revealing underlying structures, and visualizing high-dimensional data in lower dimensions.
# select numeric columns for PCA
numeric_columns <- titanic_data %>%
dplyr::select(age, fare, siblings_spouses, parents_children, survived) %>%
na.omit() # remove rows with missing values
# perform PCA
pca_result <- prcomp(numeric_columns, scale. = TRUE)
pca_scores <- as.data.frame(pca_result$x)
# add the survived column to the pca scores
pca_scores$survived <- numeric_columns$survived
ggplot(pca_scores, aes(x = PC1, y = PC2)) +
geom_point(alpha = 0.7, aes(color = survived)) +
labs(title = "PC1 vs PC2", x = "PC1", y = "PC2") +
theme_minimal()
ggplot(pca_scores, aes(x = PC2, y = PC3)) +
geom_point(alpha = 0.7, aes(color = survived)) +
labs(title = "PC2 vs PC3", x = "PC2", y = "PC3") +
theme_minimal()
ggplot(pca_scores, aes(x = PC1, y = PC3)) +
geom_point(alpha = 0.7, aes(color = survived)) +
labs(title = "PC1 vs PC3", x = "PC1", y = "PC3") +
theme_minimal()
Clustering Analysis involves grouping data points into clusters such that points within the same cluster are more similar to each other than to those in other clusters. The goal is to identify natural groupings in the data. Common methods include k-means clustering, which partitions data into a predefined number of clusters by minimizing the distance between points and their cluster centroids, and hierarchical clustering, which builds a hierarchy of clusters through either iterative merging (agglomerative) or splitting (divisive). Clustering helps in discovering patterns and segmenting data for further analysis.
# standardize or normalize the numerical features to ensure they are on a similar scale
scaled_data <- scale(numeric_columns)
K-means clustering partitions a dataset into k clusters by iteratively assigning data points to the nearest centroid and updating the centroids based on the mean of assigned points. The process continues until the centroids stabilize or a set number of iterations is reached. It is efficient for large datasets but requires specifying k and can be sensitive to initial centroid placement and outliers.
# use methods like Elbow Method or Silhouette Analysis to decide the optimal number of clusters
fviz_nbclust(scaled_data, kmeans, method = "wss")
set.seed(123) # for reproducibility
# perform k-means clustering
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25)
# add cluster assignments to the original data
titanic_data$cluster <- kmeans_result$cluster
# visualize clusters
ggplot(titanic_data, aes(x = age, y = fare, color = as.factor(cluster))) +
geom_point(alpha = 0.6) +
labs(title = "K-means Clustering", color = "Cluster") +
theme_minimal()
Hierarchical clustering creates a tree-like structure of clusters called a dendrogram. Agglomerative hierarchical clustering starts with each point as its own cluster and merges the closest clusters iteratively. Divisive hierarchical clustering starts with one large cluster and splits it iteratively. It doesn’t require specifying the number of clusters beforehand but can be computationally intensive for large datasets and does not allow for reassigning data points once clusters are formed.
# compute distance matrix
dist_matrix <- dist(scaled_data)
# perform hierarchnical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")
# cut tree to get clusters
clusters_hc <- cutree(hclust_result, k = 3)
# add cluster assignments to the original data
titanic_data$cluster_hc <- clusters_hc
# plot dendrogram
plot(hclust_result, main = "Hierarchical Clustering Dendrogram", labels = FALSE)
ggplot(titanic_data, aes(x = age, y = fare, color = as.factor(cluster_hc))) +
geom_point(alpha = 0.6) +
labs(title = "Hierarchical Clustering", color = "Cluster") +
theme_minimal()
We have already covered principal component analysis above.
ggplot(pca_scores, aes(x = PC1, y = PC2)) +
geom_point(alpha = 0.7, aes(color = survived)) +
labs(title = "PC1 vs PC2", x = "PC1", y = "PC2") +
theme_minimal()
Other options for dimensionality reduction are t-SNE, UMAP or MDS.